12 research outputs found

    Energy-efficient and high-performance lock speculation hardware for embedded multicore systems

    Full text link
    Embedded systems are becoming increasingly common in everyday life and like their general-purpose counterparts, they have shifted towards shared memory multicore architectures. However, they are much more resource constrained, and as they often run on batteries, energy efficiency becomes critically important. In such systems, achieving high concurrency is a key demand for delivering satisfactory performance at low energy cost. In order to achieve this high concurrency, consistency across the shared memory hierarchy must be accomplished in a cost-effective manner in terms of performance, energy, and implementation complexity. In this article, we propose Embedded-Spec, a hardware solution for supporting transparent lock speculation, without the requirement for special supporting instructions. Using this approach, we evaluate the energy consumption and performance of a suite of benchmarks, exploring a range of contention management and retry policies. We conclude that for resource-constrained platforms, lock speculation can provide real benefits in terms of improved concurrency and energy efficiency, as long as the underlying hardware support is carefully configured.This work is supported in part by NSF under Grants CCF-0903384, CCF-0903295, CNS-1319495, and CNS-1319095 as well the Semiconductor Research Corporation under grant number 1983.001. (CCF-0903384 - NSF; CCF-0903295 - NSF; CNS-1319495 - NSF; CNS-1319095 - NSF; 1983.001 - Semiconductor Research Corporation

    Evaluating critical bits in arithmetic operations due to timing violations

    Full text link
    Various error models are being used in simulation of voltage-scaled arithmetic units to examine application-level tolerance of timing violations. The selection of an error model needs further consideration, as differences in error models drastically affect the performance of the application. Specifically, floating point arithmetic units (FPUs) have architectural characteristics that characterize its behavior. We examine the architecture of FPUs and design a new error model, which we call Critical Bit. We run selected benchmark applications with Critical Bit and other widely used error injection models to demonstrate the differences

    Edge-TM: Exploiting transactional memory for error tolerance and energy efficiency

    No full text
    Scaling of semiconductor devices has enabled higher levels of integration and performance improvements at the price of making devices more susceptible to the effects of static and dynamic variability. Adding safety margins (guardbands) on the operating frequency or supply voltage prevents timing errors, but has a negative impact on performance and energy consumption. We propose Edge-TM, an adaptive hardware/software error management policy that (i) optimistically scales the voltage beyond the edge of safe operation for better energy savings and (ii) works in combination with a Hardware Transactional Memory (HTM)-based error recovery mechanism. The policy applies dynamic voltage scaling (DVS) (while keeping frequency fixed) based on the feedback provided by HTM, which makes it simple and generally applicable. Experiments on an embedded platform show our technique capable of 57% energy improvement compared to using voltage guardbands and an extra 21-24% improvement over existing state-of-the-art error tolerance solutions, at a nominal area and time overhead

    Playing with fire: Transactional memory revisited for error-resilient and energy-efficient MPSOC execution

    No full text
    As silicon integration technology pushes toward atomic dimensions, errors due to static and dynamic variability are an increasing concern. To avoid such errors, designers often turn to "guardband" restrictions on the operating frequency and voltage. If guardbands are too conservative, they limit performance and waste energy, but less conservative guardbands risk moving the system closer to its Critical Operating Point (COP), a frequency-voltage pair that, if surpassed, causes massive instruction failures. In this paper, we propose a novel scheme that allows to dynamically adjust to an evolving COP and operate at highly reduced margins, while guaranteeing forward progress. Specifically, our scheme dynamically monitors the platform and adaptively adjusts to the COP among multiple cores, using lightweight checkpointing and roll-back mechanisms adopted from Hardware Transactional Memory (HTM) for error recovery. Experiments demonstrate that our technique is particularly effective in saving energy while also offering safe execution guarantees. To the best of our knowledge, this work is the first to describe a full-fledged HTM implementation for errorresilient and energy-efficient MPSoC execution

    Thrifty-malloc : un gestionnaire dynamique de mémoire pour systÚmes embarqués multicoeurs avec mémoire transactionnelle matérielle

    Get PDF
    International audienceCet article prĂ©sente thrifty-malloc : un gestionnaire de mĂ©moire dynamique compatible avec la mĂ©moire transactionnelle matĂ©rielle, pour les systĂšmes embarquĂ©s multi-coeurs. Ce gestion-naire combine modularitĂ©, facilitĂ© d’utilisation et compatibilitĂ© avec la mĂ©moire transaction-nelle matĂ©rielle (HTM) dans un design lĂ©ger et peu gourmand en mĂ©moire. Thrifty-malloc est facile Ă  dĂ©ployer et Ă  conïŹgurer pour des programmeurs non-experts. Il dĂ©livre de bonnes per-formances avec un faible surcoĂ»t en mĂ©moire, pour des applications embarquĂ©es exhibant un niveau Ă©levĂ© de parallĂ©lisme exĂ©cutĂ©es sur des architectures many-coeurs. De plus, les mĂ©ca-nismes transparents qui permettent d’augmenter la rĂ©silience de ce gestionnaire aux situations dynamiques imprĂ©dictibles n’induisent qu’un faible surcoĂ»t temporel

    Speculative synchronization for coherence-free embedded NUMA architectures

    No full text
    High-end embedded systems, like their general-purpose counterparts, are turning to many-core cluster-based shared-memory architectures that provide a shared memory abstraction subject to non-uniform memory access (NUMA) costs. In order to keep the cores and memory hierarchy simple, many-core embedded systems tend to employ simple, scratchpad-like memories, rather than hardware managed caches that require some form of cache coherence management. These \u201ccoherence-free\u201d systems still require some means to synchronize memory accesses and guarantee memory consistency. Conventional lock-based approaches may be employed to accomplish the synchronization, but may lead to both useability and performance issues. Instead, speculative synchronization, such as hardware transactional memory, may be a more attractive approach. However, hardware speculative techniques traditionally rely on the underlying cache-coherence protocol to synchronize memory accesses among the cores. The lack of a cache-coherence protocol adds new challenges in the design of hardware speculative support. In this paper, we present a new scheme for hardware transactional memory support within a cluster-based NUMA system that lacks an underlying cache-coherence protocol. To the best of our knowledge, this is the first design for speculative synchronization for this type of architecture. Through a set of benchmark experiments using our simulation platform, we show that our design can achieve significant performance improvements over traditional lock-based schemes

    Hardware Transactional Memory Exploration in Coherence-Free Many-Core Architectures

    No full text
    High-end embedded systems, like their general-purpose counterparts, are turning to many-core cluster-based shared-memory architectures that provide a shared memory abstraction subject to non-uniform memory access costs. In order to keep the cores and memory hierarchy simple, many-core embedded systems tend to employ simple, scratchpad-like memories, rather than hardware managed caches that require some form of cache coherence management. These \u201ccoherence-free\u201d systems still require some means to synchronize memory accesses and guarantee memory consistency. Conventional lock-based approaches may be employed to accomplish the synchronization, but may lead to both usability and performance issues. Instead, speculative synchronization, such as hardware transactional memory, may be a more attractive approach. However, hardware speculative techniques traditionally rely on the underlying cache-coherence protocol to synchronize memory accesses among the cores. The lack of a cache-coherence protocol adds new challenges in the design of hardware speculative support. In this article, we present a new scheme for hardware transactional memory (HTM) support within a cluster-based, many-core embedded system that lacks an underlying cache-coherence protocol. We propose two alternative data versioning implementations for the HTM support, Full-Mirroring and Distributed Logging and we conduct a performance comparison between them. To the best of our knowledge, these are the first designs for speculative synchronization for this type of architecture. Through a set of benchmark experiments using our simulation platform, we show that our designs can achieve significant performance improvements over traditional lock-based schemes

    Thrifty-malloc: A HW/SW codesign for the dynamic management of hardware transactional memory in embedded multicore systems

    No full text
    We present thrifty-malloc: a transaction-friendly dynamic memory manager for high-end embedded multicore systems. The manager combines modularity, ease-of-use and hardware transactional memory (HTM) compatibility in a light-weight and memory-efficient design. Thrifty-malloc is easy to deploy and configure for non-expert programmers, yet provides good performance with low memory overhead for highly-parallel embedded applications running on massively parallel processor arrays (MPPAs) or many-core architectures. In addition, the transparent mechanisms that increase our manager's resilience to unpredictable dynamic situations incur a low timing overhead in comparison to established techniques

    Transparent and energy-efficient speculation on NUMA architectures for embedded MPSoCsProceedings of the First International Workshop on Many-core Embedded Systems - MES '13

    No full text
    High-end embedded systems such as smart phones, game consoles, GPS-enabled automotive systems, and home entertainment centers, are becoming ubiquitous. Like their general-purpose counterparts, and for many of the same energy-related reasons, embedded sys- tems are turning to multicore architectures. Moreover, as the de- mand for more compute-intensive capabilities for embedded sys- tems increases, these multicore architectures will evolve into many- core systems for improved performance or performance/area/Watt. These systems are often organized as cluster based Non-Uniform Memory Access (NUMA) architectures that provide the program- mer with a shared-memory abstraction, with the cost of sharing memory (in terms of performance, energy, and complexity) varying substantially depending on the locations of the communicating pro- cesses. This paper investigates one of the principal challenges pre- sented by these emerging NUMA architectures for embedded sys- tems: providing efficient, energy-effective and convenient mecha- nisms for synchronization and communication. In this paper, we propose an initial solution based on hardware support for specula- tive synchronization
    corecore